\title{Classes of Inference}

{{navbar}}

\subsubsection{Classes of Inference}

Inference is broadly classified under three classes: variational
inference, Monte Carlo, and exact inference.
We highlight how to use inference algorithms from each class.

As an example, we assume a mixture model with latent mixture
assignments \texttt{z}, latent cluster means \texttt{beta}, and
observations \texttt{x}:
\begin{equation*}
p(\mathbf{x}, \mathbf{z}, \beta)
=
\text{Normal}(\mathbf{x} \mid \beta_{\mathbf{z}}, \mathbf{I})
~
\text{Categorical}(\mathbf{z}\mid \pi)
~
\text{Normal}(\beta\mid \mathbf{0}, \mathbf{I}).
\end{equation*}

\subsubsection{Variational Inference}

In variational inference, the idea is to posit a family of approximating
distributions and to find the closest member in the family to the
posterior \citep{jordan1999introduction}.
We write an approximating family,
\begin{align*}
q(\beta;\mu,\sigma) &= \text{Normal}(\beta; \mu,\sigma), \\[1.5ex]
q(\mathbf{z};\pi) &= \text{Categorical}(\mathbf{z};\pi),
\end{align*}
using TensorFlow variables to represent its parameters
$\lambda=\{\pi,\mu,\sigma\}$.
\begin{lstlisting}[language=Python]
from edward.models import Categorical, Normal

qbeta = Normal(loc=tf.Variable(tf.zeros([K, D])),
               scale=tf.exp(tf.Variable(tf.zeros[K, D])))
qz = Categorical(logits=tf.Variable(tf.zeros[N, K]))

inference = ed.VariationalInference({beta: qbeta, z: qz}, data={x: x_train})
\end{lstlisting}
Given an objective function, variational inference optimizes the
family with respect to \texttt{tf.Variable}s.

Specific variational inference algorithms inherit from
the \texttt{VariationalInference} class to define their own methods, such as a
loss function and gradient.
For example, we represent
MAP
estimation with an approximating family (\texttt{qbeta} and
\texttt{qz}) of \texttt{PointMass} random variables, i.e., with all
probability mass concentrated at a point.
\begin{lstlisting}[language=Python]
from edward.models import PointMass

qbeta = PointMass(params=tf.Variable(tf.zeros([K, D])))
qz = PointMass(params=tf.Variable(tf.zeros(N)))

inference = ed.MAP({beta: qbeta, z: qz}, data={x: x_train})
\end{lstlisting}
\texttt{MAP} inherits from \texttt{VariationalInference} and defines a
loss function and update rules; it uses existing optimizers inside
TensorFlow.

\subsubsection{Monte Carlo}

Monte Carlo approximates the posterior using samples
\citep{robert1999monte}. Monte Carlo is an inference where the
approximating family is an empirical distribution,
\begin{align*}
q(\beta; \{\beta^{(t)}\})
&= \frac{1}{T}\sum_{t=1}^T \delta(\beta, \beta^{(t)}), \\[1.5ex]
q(\mathbf{z}; \{\mathbf{z}^{(t)}\})
&= \frac{1}{T}\sum_{t=1}^T \delta(\mathbf{z}, \mathbf{z}^{(t)}).
\end{align*}
The parameters are $\lambda=\{\beta^{(t)},\mathbf{z}^{(t)}\}$.
\begin{lstlisting}[language=Python]
from edward.models import Empirical

T = 10000  # number of samples
qbeta = Empirical(params=tf.Variable(tf.zeros([T, K, D]))
qz = Empirical(params=tf.Variable(tf.zeros([T, N]))

inference = ed.MonteCarlo({beta: qbeta, z: qz}, data={x: x_train})
\end{lstlisting}
Monte Carlo algorithms proceed by updating one sample
$\beta^{(t)},\mathbf{(z)}^{(t)}$ at a time in the empirical approximation.
%
Markov chain Monte Carlo does this sequentially to update
the current sample (index $t$ of \texttt{tf.Variable}s) conditional on
the last sample (index $t-1$ of \texttt{tf.Variable}s).
%
Specific Monte Carlo samplers determine the update rules;
they can use gradients such as in Hamiltonian Monte Carlo
\citep{neal2011mcmc} and graph
structure such as in sequential Monte Carlo \citep{doucet2001introduction}.

\subsubsection{Non-Bayesian Methods}

As a library for probabilistic modeling (not necessarily Bayesian
modeling), Edward is agnostic to the paradigm for inference.  This
means Edward can use frequentist (population-based) inferences,
strictly point estimation, and alternative foundations for parameter
uncertainty.

For example, Edward supports non-Bayesian methods such as generative
adversarial networks (GANs)
\citep{goodfellow2014generative}.
For more details, see the \href{/tutorials/gan}{GAN tutorial}.

In general, we think opening the door to non-Bayesian approaches is a
crucial feature for probabilistic programming. This enables advances
in other fields such as deep learning to be complementary: all is in
service for probabilistic models and thus it makes sense to combine
our efforts.

\subsubsection{Exact Inference}

In order to uncover conjugacy relationships between random variables
(if they exist), we use symbolic algebra on nodes in the computational
graph.  Users can then integrate out variables to automatically derive
classical Gibbs \citep{gelfand1990sampling},
mean-field updates \citep{bishop2006pattern}, and exact inference.

For example,  can calculate a conjugate posterior analytically by
using the \texttt{ed.complete\_conditional} function:

\begin{lstlisting}[language=Python]
from edward.models import Bernoulli, Beta

# Beta-Bernoulli model
pi = Beta(1.0, 1.0)
x = Bernoulli(probs=pi, sample_shape=10)

# Beta posterior; it conditions on the sample tensor associated to x
pi_cond = ed.complete_conditional(pi)

# Generate samples from p(pi | x = NumPy array)
sess.run(pi_cond, {x: np.array([0, 1, 0, 0, 0, 0, 0, 0, 0, 1])})
\end{lstlisting}

\subsubsection{References}\label{references}
